Data Visualization and EDA

Datasets

World Happiness Report 2018 Dataset

Column Name Explaination
Rank Overall happiness ranking
Country Country name
Score Happiness score
GDP_Per_Capita Economic contribution to happiness score
Social_Support Social contribution to happiness score
Healthy_Life_Expectancy Health contribution to happiness score
Freedom_To_Make_Life_Choices Freedom contribution to happiness score
Generosity Generosity contribution to happiness score
Perceptions_Of_Corruption Trustworthiness contribution to happiness score
Residual Portion of happiness score that is not attributed to any of the listed categories

vec.len indicates how many ‘first few’ elements are displayed of each vector. You can leave it to the default value. I have set this argument to 1 for better output file formating.

happy.df <- read.csv("WorldHappiness2018_Data.csv")
str(happy.df, vec.len=1)
## 'data.frame':    156 obs. of  10 variables:
##  $ Rank                        : int  1 2 ...
##  $ Country                     : Factor w/ 156 levels "Afghanistan",..: 45 106 ...
##  $ Score                       : num  7.63 ...
##  $ GDP_Per_Capita              : num  1.3 ...
##  $ Social_Support              : num  1.59 ...
##  $ Healthy_Life_Expectancy     : num  0.874 0.861 ...
##  $ Freedom_To_Make_Life_Choices: num  0.681 0.686 ...
##  $ Generosity                  : num  0.192 0.286 ...
##  $ Perceptions_Of_Corruption   : Factor w/ 111 levels "0.000","0.001",..: 107 103 ...
##  $ Residual                    : Factor w/ 145 levels "0.383","0.675",..: 141 131 ...

Reference:

  1. https://www.kaggle.com/PromptCloudHQ/world-happiness-report-2019
  2. https://worldhappiness.report/ed/2019/

Wages and Education of Young Males Datasets

Column Name Explaination
nr Identifier
year Year
school Years of schooling
exper Years of experience (\(=\)age\(-6-\)school)
union If wage is set by collective bargaining
ethn Ethnicity
maried If married
health If he has health problems
wage Log hourly wage
industr Industry that he was in
occupation Occupation
residence Residence location
str(Males, vec.len=1)
## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 ...
##  $ year      : int  1980 1981 ...
##  $ school    : int  14 14 ...
##  $ exper     : int  1 2 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 ...
##  $ wage      : num  1.2 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 ...

NYC Flights Data in 2013

A data frame contains all 336,776 flights departing from New York City in 2013.

Column Name Explaination
year, month, day Date of departure
dep_time, arr_time Actual departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time, sched_arr_time Scheduled departure and arrival times (format HHMM or HMM), local tz.
dep_delay,arr_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
hour, minute Time of scheduled departure broken into hour and minutes.
carrier Two letter carrier abbreviation. See airlines() to get name
tailnum Plane tail number
flight Flight number
origin, dest Origin and destination. See airports() for additional metadata.
air_time Amount of time spent in the air, in minutes
distance Distance between airports, in miles
time_hour Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

ggplot2: Data Visualization

The last step before exploratory data analysis (EDA) is visualization. Base R offers many tools to you to have a good look into data by creating simple plots. However, ggplot2 is much more elegant and versatile.

Graphing Template

ggplot(data = <DATA>) + 
    <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

All ggplot2 commands can be thought as following this template. It starts with the ggplot function, followed by a string of geom functions. All functions are connected by +.

It is a good practice to pass the common dataset to the ggplot function rather than later. In geom functions, mapping argument requests a list of aesthetic mappings to use for plot, which is typically returned by aes function. Generally, we do not need to care about the details behind this. It is sufficient just to treat mapping = aes(<MAPPINGS>) as one complete structure.

Note that DO NOT PUT + SIGN IN THE BEGINNING OF A NEW LINE. The plus sign has to come at the end of a line.

One Variable

Bar Charts

geom_bar by default makes the height of the bar proportional to the number of observations in each group.

ggplot(data = Males) + 
  geom_bar(mapping = aes(x = industry, fill=maried)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degrees

count is information derived from original data. In other words, statistical transformation, or more specifically, counting (stat_count) happens in geom_bar function. Therefore, the following graph is identical to the graph above.

ggplot(data = Males) + 
  stat_count(mapping = aes(x = industry, fill=maried)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degrees

Position Adjustment 1: dodge

By default, bars are stacked if each group specified by the x variable can be divided into subgroups by additional information that we provide, which is maried in this example. If you prefer to places overlappping objects side by side, pass position = "dodge" to geom_bar.

ggplot(data = Males) + 
  geom_bar(mapping = aes(x = industry, fill=maried), position="dodge") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degrees

Position Adjustment 2: fill

If you prefer proportion to count, (in this example, you are more concerend with the proportion of single men in each industry rather than the number), try position = "fill".

ggplot(data = Males) + 
  geom_bar(mapping = aes(x = industry, fill=maried), position="fill") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degrees

Density Plot and Histogram

ggplot(happy.df, mapping = aes(x = Healthy_Life_Expectancy)) +
  geom_density(kernel='gaussian') + 
  geom_histogram(mapping = aes(y=..density..), bins=20, alpha=0.5)

geom_density: talk about kernel geom_histogram: bins and binwidth

Two Variables

Scatterplots

Scatterplots are most useful to see the relationship between two continuous variables.

Suppose one would like to find out the relationship between money and happiness based on this dataset. The first step is usually plotting a scatterplot (dot plot) to have a sense about the trend.

As introduced previously, the dataset happy.df is passed to ggplot. The function geom_point is the function used to create scatterplots. In this example, the function aes helps to specify which column is on the x-axis and which on y-axis.

ggplot(data = happy.df) + 
    geom_point(mapping = aes(x = GDP_Per_Capita, y = Score))

According to the data, it looks like money and happiness do have quite strong positive correlation.

Scatterplots are flexible. Additional to the location of points, marker type (shape), marker size (size), marker color (color) and transparency (alpha) can also be used to encode information.

shape can only be used to represent discrete values, while color, alpha and size are good for both discrete and continuous values. All these attributes go inside the aes function.

The following example uses transparency to represent life expectancy.

ggplot(data = happy.df) + 
    geom_point(mapping = aes(x = GDP_Per_Capita, y = Score, alpha = Healthy_Life_Expectancy))

Attributes defined outside of mapping do not carry information on dataset. Their values have to be provided externally.

ggplot(data = happy.df) + 
    geom_point(mapping = aes(x = GDP_Per_Capita, y = Score), color = "blue", shape=7)

Ideally, scatterplots are for two continuous variables. However, it can also be used to compare one continuous variable and one categorical variable. As you can see, this is not ideal because many points are overlapping since they are condensed on limited choices of experience values.

ggplot(data = Males) + 
    geom_point(mapping = aes(x = exper, y = wage, color=maried))

To mitigate this problem, setting position='jitter' to add a small amount of random variation to the location of eachi point.

ggplot(data = Males) + 
    geom_point(mapping = aes(x = exper, y = wage, color=maried), position='jitter')

Alternatively, use geom_jitter, which is a convenient shortcut for geom_point(position = 'jitter').

ggplot(data = Males) + 
    geom_jitter(mapping = aes(x = exper, y = wage, color=maried))

Quiz: What goes wrong in the this plot?

ggplot(data = happy.df) + 
    geom_point(mapping = aes(x = GDP_Per_Capita, y = Score, color = "blue"))

Lines

ggplot(data = happy.df,  aes(x = Score, y = GDP_Per_Capita)) + 
  geom_point() + 
  geom_line()

ggplot(data = happy.df,  aes(x = Score, y = GDP_Per_Capita)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Rug Plots

A rug plot is a compact visualisation designed to supplement a 2d display with the two 1d marginal distributions. Rug plots display individual cases so are best used with smaller datasets.

ggplot(data = happy.df, 
       mapping = aes(x = GDP_Per_Capita, y = Score)) + 
    geom_point() + 
    geom_rug(sides = "bl")

Boxplots

ggplot(flights) + 
  geom_boxplot(mapping = aes(x = carrier, y = air_time), na.rm = TRUE)

How to read boxplot?

Violin Plots

ggplot(flights) + 
  geom_violin(mapping = aes(x = carrier, y = air_time), na.rm = TRUE)

ggplot(flights, aes(x = carrier, y = air_time)) + 
  geom_boxplot(na.rm = TRUE) + 
  geom_violin(na.rm = TRUE)

2D Density Plots

ggplot(happy.df, aes(x=Healthy_Life_Expectancy, y=GDP_Per_Capita)) + 
  geom_density2d()

Hex Plot

ggplot(happy.df, aes(x=Healthy_Life_Expectancy, y=GDP_Per_Capita)) + 
  geom_hex(binwidth=c(0.2, 0.5))

Facets: Groups of Scatterplots

Besides using aesthetics attributes to add additional, such as alpha and shape, when dealing with categorical variables, one can also split plots into facets, so that we have a group of scatterplots, each represents one group.

Suppose we would like to find out how experience affects wage for the two ethnic minority groups across 12 industries. We can create 12 scatterplots, one for each industry, instead of plotting one comprehensive scatterplot with all information, which is most likely very messy.

To create this facet, call facet_wrap() after geom_point() or geom_gitter(). The first argument of face_wrap() is a formula. (Formula is a data structure in R, which can be seen as an expression with ~). ~ industry tells R to create a facet according to levels in industry.

Males %>% 
  filter(ethn != "other" ) %>% 
    ggplot() + 
      geom_jitter(mapping = aes(x = exper, y = wage, color=ethn), alpha=0.5) + 
      facet_wrap( ~ industry, nrow=4)

It is easy to draw some preliminary conclusions according to the facet. For example, most observations are from manufacturing and trade. In finance industry, a black man usually earns more given a same year of experience, according to the dataset.

You can also create a facet based on combination of levels among multiple discrete variables. To do this, put ~ between variable names. For example, instead of encoding ethnicities as colors, I create a facet based on the combination of ethnicity and industry.

Males %>% 
  filter(ethn != "other" ) %>% 
    ggplot() + 
      geom_jitter(mapping = aes(x = exper, y = wage), alpha=0.5) + 
      facet_wrap(ethn ~ industry, nrow=4)

It is obvious that the variable passed to facet_wrap() should be discrete.

Coordinate System

ggplot(data = Males) + 
  geom_bar(mapping = aes(x = school, fill=occupation)) +
  coord_flip()

ggplot(data = Males) + 
  geom_bar(mapping = aes(x = school, fill=occupation)) +
  coord_polar()

Geometric Objects

ggplot(data = happy.df, mapping = aes(x = GDP_Per_Capita, y = Score)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = happy.df, mapping = aes(x = GDP_Per_Capita, y = Score)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = Males, mapping = aes(x = exper, y = wage, color=maried)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Map

map.world <- map_data('world')
happy.df <- happy.df %>% 
  mutate(Country = as.character(Country)) %>%
  mutate(Country = if_else(Country == "United States", 'USA', 
 if_else(Country == "United Kingdom", 'UK', 
 Country)))
map.df <- left_join(map.world, happy.df, by = c('region' = 'Country'))
ggplot(data = map.df, aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = Score)) + 
  scale_fill_viridis() + 
  theme_bw() + 
  labs(title = "Happiness Score by Country", subtitle = "Wold Happiness Report 2018")

The Google API

As of mid-2018, the Google Maps Platform requires a registered API key. To use the Google Maps service, you are required to register an API. Go to the API registration page, check all map services and follow the instruction. The geocoding API is free if you remain in the free tier. Nevertheless you need to associate a credit card with the account.

register_google("your.api.key")
countries_loc <- geocode(c("Hong Kong", "New York, USA", "Tokyo, Japan", "London", 
                           "Singapore", "Shanghai", "Toronto", "Zurich", "Beijing",
                           "Frankfurt"))
countries_lon <- countries_loc$lon
countries_lat <- countries_loc$lat
ggplot(data = countries_loc) + 
  borders("world", fill = "grey", colour = "grey") + 
  geom_point(mapping = aes(x = countries_lon, y = countries_lat, color="red")) + 
  scale_fill_viridis() + 
  theme(legend.position="none") + 
  labs(title = "Financial Centers Distribution", 
       subtitle = "According to Global Financial Centres Index (2007–ongoing)")

Summary

Type of Plots Geom Functions
Scatterplots geom_point, geom_jitter

Lutao DAI

Aug 20, 2019